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Abstract. Traditional parallel schedulers running on cluster supercomputers sup- 
port only static scheduling, where the number of processors allocated to an ap- 
plication remains fixed throughout the execution of the job. This results in under- 
utilization of idle system resources thereby decreasing overall system throughput. 
In our research, we have developed a prototype framework called ReSHAPE, 
which supports dynamic resizing of parallel MPI applications executing on dis- 
tributed memory platforms. The resizing library in ReSHAPE includes support 
for releasing and acquiring processors and efficiently redistributing application 
state to a new set of processors. In this paper, we derive an algorithm for redis- 
tributing two-dimensional block-cyclic arrays from P to Q processors, organized 
as 2-D processor grids. The algorithm ensures a contention-free communication 
schedule for data redistribution if P r < Q r and P c < Q c . In other cases, the al- 
gorithm implements circular row and column shifts on the communication sched- 
ule to minimize node contention. 
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1 Introduction 

As terascale supercomputers become more common and as the high-performance com- 
puting (HPC) community turns its attention to petascale machines, the challenge of 
providing effective resource management for high-end machines grows in both impor- 
tance and difficulty. A fundamental problem is that conventional parallel schedulers are 
static, i.e., once a job is allocated a set of resources, they remain fixed throughout the 
life of an application's execution. It is worth asking whether a dynamic resource man- 
ager, which has the ability to modify resources allocated to jobs at runtime, would allow 
more effective resource management. The focus of our research is on dynamically re- 
configuring parallel applications to use a different number of processes, i.e., on dynamic 
resizing of applications. Q 

In order to explore the potential benefits and challenges of dynamic resizing, we are 
developing ReSHAPE, a framework for dynamic Resizing and Scheduling of Homo- 
geneous Applications in a Parallel Environment. The ReSHAPE framework includes a 
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programming model and an API, data redistribution algorithms and a runtime library, 
and a parallel scheduling and resource management system framework. ReSHAPE al- 
lows the number of processors allocated to a parallel message-passing application to be 
changed at run time. It targets long-running iterative computations, i.e., homogeneous 
computations that perform similar computational steps over and over again. By moni- 
toring the performance of such computations on various processor sizes, the ReSHAPE 
scheduler can take advantage of idle processors on large clusters to improve the turn- 
around time of high-priority jobs, or shrink low -priority jobs to meet quality-of-service 
or advanced reservation commitments. 

Dynamic resizing necessiates runtime application data redistribution. Many high 
performance computing applications and mathematical libraries like ScaLAPACK (T) 
require block-cyclic data redistribution to achieve computational efficiency. Data re- 
distribution involves four main stages — data identification and index computation, 
communication schedule generation, message packing and unpacking and finally, data 
transfer. Each processor identifies its part of the data to redistribute and transfers the 
data in the message passing step according to the order specified in the communication 
schedule. A node contention occurs when one or more processors sends messages to a 
single processor. A redistribution communication schedule aims to minimize these node 
contentions and maximiz network bandwidth utilization. Data is packed or marshalled 
on the source processor to form a message and is unmarshalled on the destination pro- 
cessor. 

In this paper, we present an algorithm for redistributing two-dimensional block- 
cyclic data from P ( P r rows xP c columns) to Q (Q r rows xQ c columns) processors, 
organized as 2-D processor grids. We evaluate the algorithm's performance by measur- 
ing the redistribution time for different block-cyclic matrices. If P r < Q r andP c < Q c , 
the algorithm ensures a contention-free communication schedule for redistributing data 
from source processor set P to Q processor set. In other cases the algorithm minimizes 
node contentions by performing row or column circular shifts on the communication 
schedule. The algorithm discussed in this paper supports 2-D block cyclic data redistri- 
bution for only one- and two-dimensional processor topology. We also discuss in detail 
the modifications needed to port an existing scientific application to use the dynamic 
resizing capability of ReSHAPE using the API provided by the framework. 

The rest of the paper is organized as follows: Section[2]discusses prior work in the 
area of data redistribution. Section [3] briefly reviews the architecture of the ReSHAPE 
framework and discusses in detail the two-dimensional redistribution algorithm and the 
ReSHAPE API. Section [4] reports our experimental results of the redistribution algo- 
rithm with the ReSHAPE framework tested on the SystemX cluster at Virginia Tech. 
We conclude in Section|5]discussing future directions to this research. 

2 RelatedWork 

Data redistribution within a cluster using message passing approach has been exten- 
sively studied in literature. Many of the past research efforts 12 ID H 10 Q 
El 13 ifTUll ifTTI |[T2l were targeted towards redistributing cyclically distributed one 
dimensional arrays between the same set of processors within a cluster on a 1-D pro- 
cessor topology. To reduce the redistribution overhead cost, Walker and Otto [12] and 



Kaushik |7| proposed a K-step communication schedule based on modulo arithmetic 
and tensor products repectively. Ramaswamy and Banerjee |9j proposed a redistribu- 
tion technique, PITFALLS, that uses line segments to map array elements to a processor. 
This algorithm can handle any arbitrary number of source and destination processors. 
However, this algorithm does not use communication schedules during redistribution 
resulting in node contentions during data transfer. Thakur et al. IfTTlllfTOl use gcd and 
1cm methods for redistributing cyclically distributed one dimensional arrays on the same 
processor set. The algorithms described by Thakur et al. ifTOl and Ramaswamy 151 use 
a series of one-dimensional redistributions to handle multidimensional arrays. This ap- 
proach can result in significant redistribution overhead cost due to unwanted commu- 
nication. Kalns and Ni [6] presented a technique for mapping data to processors by 
assigning logical processor ranks to the target processors. This technique reduces the 
total amount of data that must be communicated during redistribution. Hsu et al. [5 | 
further extended this work and proposed a generalized processor mapping technique 
for redistributing data from cyclic(kx) to cyclic(x), and vice versa. Here, x denotes the 
number of data blocks assigned to each processor. However, this method is applicable 
only when the number of source and target processors are same. Chung et al. ||2) pro- 
posed an efficient method for index computation using basic-cycle calculation (BCC) 
technique for redistributing data from cyclic(x) to cyclic(y) on the same processor set. 
An extension of this work by Hsu et al. [ 1 3 1 uses generalized basic-cyclic calcula- 
tion method to redistribute data from cyclic(x) over P processors to cyclic(y) over Q 
processors. The generalized BCC uses uses bipartite matching approach for data re- 
distribution. Lim et al. [8 1 developed a redistribution framework that could redistribute 
one-dimensional array from one block-cyclic scheme to another on the same processor 
set using a generalized circulant matrix formalism. Their algorithm applies row and col- 
umn transformations on the communication schedule matrix to generate a conflict-free 
schedule. 

Prylli et al. [ 14|, Desprez et al. Q and Lim et al. [ 15] proposed efficient algorithms 
for redistributing one- and two-dimensional block cyclic arrays. Prylli et al. [ 14 1 pro- 
posed a simple scheduling algorithm, called Caterpillar, for redistributing data across a 
two-dimensional processor grid. At each step d in the algorithm, processor Pi (0 < i < 
P) in the destination processor set exchanges its data with processor P((p-i-d) mod p) ■ 
The Caterpillar algorithm does not have a global knowledge of the communication 
schedule and redistributes the data using the local knowledge of the communications at 
every step. As a result, this algorithm is not efficient for data redistribution using "non- 
all-to-air" communication. Also, the redistribution time for a step is the time taken to 
transfer the largest message in that step. Desprez et al. JJ| proposed a general solution 
for redistributing one-dimensional block-cyclic data from a cyclic(x) distribution on a 
P-processor grid to a cyclic(y) distribution on a Q-processor grid for arbitrary values of 
P, Q, x, and y. The algorithm assumes the source and target processors as disjoint sets 
and uses a bipartite matching to compute the communication schedule. However, this 
algorithm does not ensure a contention-free communication schedule. In a recent work, 
Guo and Pan |4| described a method to construct schedules that minimizes number of 
communication steps, avoids node contentions, and minimizes the effect of difference 
in message length in each communication step. Their algorithm focuses on redistribut- 
ing one-dimensional data from a cyclic(kx) distribution on P processors to cyclic(x) 



distribution on Q processors for any arbitrary positive values of P and Q. Lim et al. 1 15 1 
propose an algorithm for redistributing a two-dimensional block-cyclic array across 
a two-dimensional processor grid. But the algorithm is restricted to redistributing data 
across different processor topologies on the same processor set. Parket al. [ 16 1 extended 
the idea described by Lim et al. ifTSI and proposed an algorithm for redistributing one- 
dimensional block-cyclic array with cyclic(x) distribution on P processors to cyclic(kx) 
on Q processors where P and Q can be any arbitrary positive value. 

To summarize, most of the existing approaches either deal with redistribution of 
block-cyclic array across one-dimensional processor topology on the same or on a dif- 
ferent processor set. The Caterpillar algorithm by Prylli et al. Ifl4ll is the closest related 
work to our redistribution algorithm in that it supports redistribution on checkerboard 
processor topology. In our work, we extend the idea in 1 1 5 1 [ 1 6 ] to develop an algorithm 
to redistribute two-dimensional block-cyclic data distributed across a 2-D processor 
grid topology. The data is redistributed from P (P r x P c ) to Q (Q r x Q c ) processors 
where P and Q can be any arbitrary positive value. Our work is contrary to Desprez et 
al. JJ| where they assume that there is no overlap among processors in the source and 
destination processor set. Our algorithm builds an efficient communication schedule 
and uses non-all-to-all communication for data redistribution. We apply row and col- 
umn transformations using the circulant matrix formalism to minimize node contentions 
in the communication schedule. 

3 System Overview 

The ReSHAPE framework, shown in Figure [T(a)] consists of two main components. The 
first component is the application scheduling and monitoring module which schedules 
and monitors jobs and gathers performance data in order to make resizing decisions 
based on application performance, available system resources, resources allocated to 
other jobs in the system and jobs waiting in the queue. The second component of the 
framework consists of a programming model for resizing applications. This includes 
a resizing library and an API for applications to communicate with the scheduler to 
send performance data and actuate resizing decisions. The resizing library includes al- 
gorithms for mapping processor topologies and redistributing data from one processor 
topology to another. The individual components in these modules are explained in detail 
by Sudarsan and Ribbens [ 17 1. 

3.1 Resizing library 

The resizing library provides routines for changing the size of the processor set assigned 
to an application and for mapping processors and data from one processor set to another. 
An application needs to be re-compiled with the resize library to enable the scheduler to 
dynamically add or remove processors to/from the application. During resizing, rather 
than suspending the job, the application execution control is transferred to the resize 
library which maps the new set of processors to the application and redistributes the 
data (if required). Once mapping is completed, the resizing library returns control back 
to the application and the application continues with its next iteration. The application 




Fig. 1. (a) Architecture of ReSHAPE (b) State diagram for application expansion and 
shrinking 



user needs to indicate the global data structures and variables so that they can be redis- 
tributed to the new processor set after resizing. Figure [T(b)| shows the different stages 
of execution required for changing the size of the processor set for an application. 

Our API gives programmers a simple way to indicate resize points in the application, 
typically at the end of each iteration of the outer loop. At resize points, the application 
contacts the scheduler and provides performance data to the scheduler. The metric used 
to measure performance is the time taken to compute each iteration. The scheduler's de- 
cision to expand or shrink the application is passed as a return value. If an application is 
allowed to expand to more processors, the response from the Remap Scheduler includes 
the size and the list of processors to which an application should expand. A call to the 
redistribution routine remaps the global data to the new processor set. If the Sched- 
uler asks an application to shrink, then the application first redistributes its global data 
across a smaller processor set, retrieves its previously stored MPI communicator, and 
creates a new BLACS |[T8l context for the new processor set. The additional processes 
are terminated when the old BLACS context is exited. The resizing library notifies the 
Remap Scheduler about the number of nodes relinquished by the application. 



3.2 Application Programming Interface (API) 

A simple API allows user codes to access the ReSHAPE framework and library. The 
core functionality is accessed through the following internal and external interfaces. 
These functions are available for use by advanced application programmers. These 
functions provide the main functionality of the resizing library by contacting the sched- 
uler, remapping the processors after an expansion or a shrink, and redistributing the 
data. These functions are listed as follows: 

- reshape Jnitialize (global data array, nprocessors, Macs .context, iterationCount, 
processor .row, processor jcolumn, jobJd): initializes the iterationCount and the 
global data array with the initial values and creates a blacs_context for the two- 
dimensional processor topology. The function returns values for processor row, 
column configuration and jobjd. 

- reshape .ContactSchedulerf iteration Jime, redistribution Jime, processor .rowjcount, 
processor .column .count, jobJd): contacts the scheduler and supplies last iteration 
time; on return, the scheduler indicates whether the application should expand, 
shrink, or continue execution with the current processor size. 

- reshape J^xpand (): adds the new set of processors (defined by previous call to 
reshape_contactScheduler) to the current set using BLACS. 

- reshape .Shrink (): reduces the processor set size (defined by previous call to re- 
shapexontactScheduler) to an earlier configuration and relinquishes additional pro- 
cessors. 

- reshape Jiedistributef Global data array, current BLACS context, current processor 
set size, EXPAND/SHRINK): redistributes global data among the newly spawned or 
shrunk processors. The redistribution time is computed and stored for next resize 
point. 

- reshape J^og (starttime, endtime): computes the average iteration time of the current 
iteration for all the processors and stores it for next resize point. 

Figure [2(a)| shows the source code for a simple MPI application for solving a se- 
quence of linear system of equations using ScaLAPACK functions. The original code 
was refactored to identify the global data structures and variables. The ReSHAPE API 
calls were inserted at the appropriate locations in the refactored code. Figure [2(b)| shows 
the modified code. 

3.3 Data Redistribution 

The data redistribution library in ReSHAPE uses an efficient algorithm for redistribut- 
ing block-cyclic arrays between processor sets organized in a 1-D (row or column for- 
mat) or checkerboard processor topology. The algorithm for redistributing 1-D block- 
cyclic array over a one-dimensional processor topology was first proposed by Park et 
al. |fl6l . We extend this idea to develop an algorithm to redistribute both one- and two- 
dimensional block-cyclic data across a two-dimensional processor grid of processors. 
In our redistribution algorithm, we assume the following: 

- Source processor configuration: P r x P c (rows x columns), P r , P c > 0. 



/^Identification of Global arrays and variables*/ 

double **A**B; 

int maxlterations = 10; 

int blacs_context, iterationCount, nprocessorjow, 

nprocessor_column,jobJd; 

int iteration Jime, redistribution Jime; 



vith dimensions n x p 



int main{int argc, char**argv[]){ 
double **A,**B; 
int maxlterations =10; 
//MPI Initializations 

//Read Global matrix A of dimensions m x n, B with 
dimensions n x p 

for(iterationCount=0;iterationCount<maxlterations;itera 
tionCount+ + ) 
{ 

//Compute descriptor and other parameters for 
PDGETRF and 
PDGETRS 

// Solve linear system of equations using LU 
factrorization 
} 



int main{int argc, char**argv[]){ 
//MPI Initialization 

//Read Global matrix A of dimensions m > 

reshape Initialize^, size, b!acs_context, iterationCount, nprocessorjow, 

nprocessor_column, jobjd); 
reshape lnitializefe, size, blacs_context, iterationCount, nprocessorjow 
, nprocessor_column Jobjd ); 

ComputeQ; //Refactoring the original code 

} 

void compute(){ 

for(;iterationCount<maxlterations; iterationCount+ + ) 

{ 

//Read array dimensions 

//Compute descriptor and other parameters for PDGETRF and PDGETRS 
start = MPI_Wtime(); 

II Solve linear system of equations by performing LU factorization 
end = MPIJVtimeO; 

res ha pe Log (start, end); 
return_scheduler_decision = 
res ha pe ContactScheduler (iteration Jime, redistribution Jime, 

nprocessorjow, nprocessorjolumn, jobjd); 
if (return jchedulerjJecision = = EXPAND) 
i 

reshape_Expand(); 

reshapeRedistributefA blacsjontext.nprocessorjow, 

nprocessorjolumn, EXPAND); 
reshapeRedistributefB, blacsjontext.nprocessorjow, 
nprocessorjolumn, EXPAND); 

} 

else if (return jchedulerjJecision == SHRINK) 
{ 

reshape_Shrink(); 

reshape_RedistributefAib/acs_context,nprocessor_row 

.nprocessorjolumn, SHRINK); 
reshapeRedistributefB, blacsjontext.nprocessorjow, 
nprocessorjolumn.SHRINK); 

} 



(a) 



(b) 



Fig. 2. (a) Original MPI code for solving system of linear equations, (b) Code modified 
for resizing using ReSHAPE's API 



- Destination processor configuration: Q r x Q c (rows x columns), Q r , Q c > 0. 

- The data granularity is set at the block level, i.e., a block is the smallest data that 
will be transferred which cannot be further subdivided. This block size is specified 
by the user. 

- The data matrix, data, which needs to be redistributed, is of dimension n x n. 

- Let the block size be NB. Therefore total number of data blocks = (n/NB) * 
(n/NB) = N x N, represented using matrix Mat. 

- We use Mat(x, y) to refer block(x, y), < x, y < N. 

- The data that can be equally divided among the source and destination processors P 
and Q respectively, i.e., N is evenly divisible by P r , P c , Q r , and Q c . Each processor 
has an integer number of data blocks. 

- The source processors are numbered P^j), < i < P r , < j < P c and the 
destination processors are numbered as Q(i,j)> < i < Q rv < j < Q c 



Problem Definition. We define 2D block-cyclic distribution as follows: Given a two 
dimensional array of n x n elements with block size NB and a set of P processors 
arranged in checkerboard topology, the data is partitioned into N x N blocks and dis- 
tributed across P processors, where N = n/NB. Using this distribution a matrix block, 
Mat(x, y), is assigned to the source processor P c * (x%P r ) + y%P c , < x < N, 
< y < N. Here we study the problem of redistributing a two-dimensional block- 
cyclic matrix from P processors to Q processors arranged in checkerboard topology, 
where P ^ Q and NB is fixed. After redistribution, the block Mat(x, y) will belong 
to the destination processor Q c * (x%Q r ) + y%Q c , 0<x<N,0<y<N. 
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Fig. 3. (a) P = 4 (2 x 2), Q = 12 (3 x 4) Data layout in source and destination processors, 
(b) Creating of Communication Schedule (CTransfer) from Initial Data Processor Con- 
figuration table (IDPC), Final Data Processor Configuration table (FDPC) 



Redistribution Terminologies. 

(a) Superblock: Figure |3(a)| shows the checkerboard distribution of a 8 x 6 block- 
cyclic data on source and destination processor grids. The 600 entry in the source 
layout table indicates that the block of data is owned by processor P(o,o)> block 
denoted by bOl is owned by processor P(o,i) and so on. The numbers on the top 
right corner in every block indicates the id of that data block. From this data layout, 
a periodic pattern can be identified for redistributing data from source to destina- 
tion layout. The blocks Mat(0,0), Mat(0,2), Mat(2,0), Mat(2,2), Mat(A,0) 
and Mat(A, 2), owned by processor f(o,o) m the source layout, are transferred to 
processors Q(o,o)» Q(o,2). Q(2,o). Q(2,2)» Q(i,o) and Q(i,2)- This mapping pattern 
repeats itself for blocks Mat(0,4), Mat(0,6), Mat(2,4), Mat(2,6), Mat(4,4) 



and Mat(4, 6). Thus we can see that the communication pattern of the blocks 
Mat(i,j), < i < 5, < j < 4 repeats for other blocks in the data. A superblock 
is defined as the smallest set of data blocks whose mapping pattern from source to 
destination processor can be uniquely identified. For a 2-D processor topology data 
distribution, each superblock is represented as a table of R rows and C columns, 
where 

R = lcm(P r , Q r ) C = lcm(P c , Q c ) 

The entire data is divided into multiple superblocks and the mapping pattern of 
the data in each superblock is identical to the first superblock, i.e., the data blocks 
located at the same relative position in all the superblocks are transferred to the 
destination processor. A 2-D block matrix with Sup elements is used to represent 
the entire data where each element is a Superblock. The dimensions of this block 
matrix are Sup R and Sup c where, 

Sup R = N/R Supc = N/C Sup = (N/R * N/C) 

(b) Layout: Layout is an 1-D array of SupR * Supc elements where each element 
is a 2-D table which stores the block ids present in that superblock. There are Sup 
number of 2-D tables in the Layout array where each table has the dimension RxC. 

(c) Initial Data-Processor Configuration (IDPC): This table represents the initial 
processor layout for the data before redistribution for a single superblock. Since 
the data-processor mapping is identical over all the superblocks, only one instance 
of this table is created. The table has R rows xC columns. IDPC(i,j) contains 
the processor id P(ij) that owns the block Mat(i,j) located at the same relative 
position in all the superblocks, (0 < i <, R, < j < C). 

(d) Final Data-Processor Configuration (FDPC): The table represents the final pro- 
cessor configuration for the data layout after redistribution for a single superblock. 
Like IDPC, only one instance of this table is created and used for all the data su- 
perblocks. The dimensions of this table is RxC. FDPC(iJ) contains the processor 
id Q(ij) that owns the block Mat(i,j) after redistribution located at the same rel- 
ative position in all the superblocks, (0 < i < R, < j < C). 

(e) The source processor for any data block Mat ( i, j) in the data matrix can be computed 
using the formula 

Source{i, j) = P c * (i%P r ) + (j%P c ) 

(f) Communication schedule send table (Crransfer)' This table contains the final 
communication schedule for redistributing data from source to destination layout. 
This table is created by re-ordering the FDPC table. The columns of Crransfer 
correspond to P source processors and the rows correspond to individual commu- 
nication steps in the schedule. The number of rows in this table is determined by 
(R*C)/ P. The network bandwidth is completely utilized in every communication 
step as the schedule involves all the source processors in data transfer. A positive 
entry in the Crransfer table indicates that in the i th communication step, processor 
j will send data to C T ransfer{i, j), < i < (R * C)/P, < j < (P r * P c ). 

(g) Communication schedule receive table (C'R ecv ): This table is derived from the 
Crransfer table where the columns correspond to the destination processors. The 
table has the same number of rows as the Crransfer table. A positive entry at 
CRecv(i, j) indicates that processor j will receive data from source processor at 
CRecv(hj) in the i th communication step, < i < (R*C)/P,0 <j< (Q r *Qc)- 



If (Qr * Qc) > (Pr * P c ), then the additional entries in the Cn ecv table are filled 
with -1. 

Algorithm. 

Step 1: Create Layout table 

The Layout array of tables are created by traversing through all the data blocks in 
matrix Mat(i,j), where < i, j < N, < j < N. The superblocks in Mat(i,j) 
is traversed in row-major format. 
Pseudocode: 

for superblockcount <— to Sup — 1 do 
for i <- to R/P r - 1 do 
forj^Oto C/P c -ldo 
for k <- to P r - 1 do 
for I <- to P c - 1 do 

Layout[superblockcount](i * C/P c + k,j * R/P r + I) = 
Mat(superblockid row * R + i * P c + k, 
superblockidcoi * C + j * P r + I) 
if '(reached end of column) then 
increment Supu 
Supc <— 

else 

increment Supc 

Step 2: Creating IDPC and FDPC tables 

An entry at IDPC(i, j) is calculated using the index i and j of the table and the 
size of the source processor set P, < i < R, < j < C. The Source function 
returns the processor id of the owner of the data before redistribution stored in that 
location. 

Similarly, an entry FDPC(i, j) is computed using the i and j coordinates of the 
table and the size of the destination processor set Q, < i < R, < j < C. 
The Source function returns the processor id of the owner of the redistributed data 
stored in that location. 
Pseudocode: 

for i <- to R - 1 do 

for j <- to C - 1 do 

IDPC(i,j) «- Source(i,j) «- P c * (i%P r ,j%P c ) 



for i <- to i? - 1 do 

for j <- to C - 1 do 

FDPC(i,j) «- Source(i,j) «- Q c * (i%Q r ,j%Q c ) 

Step 3: Communication schedule tablesf Crransfer and Cn ecv ) 

The Crransfer table stores the final communication schedule for transferring data 



between the source and the destination processors. The columns in Crransfer cor- 
respond to source processor Pu t j). The table has CrransferRows rows and (P r * 
P c ) columns, where 

CrransferRows = (R * G) j (P r * P c ) 

Each entry in the Crransfer table is filled by sequentially traversing the FDPC 
table in row-major format. The data corresponding to each processor inserted at the 
appropriate column at the next available location. An integer counter updates itself 
and keeps track of the next available location (next row) for each processor. 
Pseudocode: 

processor Jd = IDPC(i, j) 

CTrans fer {counter j, processor Jd) <— FDPC{i,j) 
Update counter j 

where < i < R and < j < C. Each row in the Crransfer table forms 
a single communication step where all the source processors send the data to a 
unique destination processor. The Crccv table is used by the destination processors 
to know the source of their data in a particular communication step. 

CRecv{i, CTransfer{i,j)) = j 

where < i < CrransferRows and < j < {Q r x Q c ). 

Node contention can occur in the Crransfer communication schedule if any one of 
the following conditions are true 

(0 Pr > Qr 

(ii) Pc > Qc 

(iii) P r > Q r and P c > Q c 

If there are node contentions in the communication schedule, create a Processor 
Mapping (PM) table of dimension R x C and initialize it with the values from 
FDPC table. To reduce node contentions, the PM tables are circularly shifted in 
row or columns. To maintain data consistency, same operations are performed on 
the IDPC table and the superblock tables within the Layout array. The Crransfer 
table is created from the modified PM table. We identify 3 situations where node 
contentions can occur. Case 1 and case 2 are applicable during both expansion and 
shrinking of an application while Case 3 can occur only when an application is 
shrinking to a smaller destination processor set. 

Do the following operation on IDPC, PM and on each 2-D table in the Layout array. 
Case 1: If P r > Q r and P c < Q c then 

1. Create {R/P r ) groups with P r rows in each group. 

2. For 1 < i < P r , perform a circular right shift on each row i by P c * i elements 
in each group. 

3. Create the Crransfer table from the resulting PM table. 
Case 2: If P r < Q r and P c > Q c then 

1. Create (C/P c ) groups with P c columns in each group. 

2. For 1 < j < P c , perform a circular down shift on each column j by P r * j 
elements in each group. 

3. Create the Crransfer table from the resulting PM table. 
Case 3: If P r > Q r and P c > Q c then 



1. Create (Cj P c ) groups with P c columns in each group. 

2. For 1 < j < P c , perform a circular down shift each column j by P r * j elements 
in each group. 

3. Create (R/ P r ) groups with P r rows in each group. 

4. For 1 < i < P r , perform a circular right shift each row i by P c * i elements in 
each group 

5. Create the Crransfer table from the resulting PM table. 

The Cuecv table is not used when the schedule is not contention-free. Node con- 
tention results in overlapping entries in the Cr ccv table thus rendering it as unus- 
able. 

Step 4: Data marshalling and unmarshalling 

If a processor's rank equal the value at IDPC(i, j), then the processor collects the 
data from the relative indexes of all the superblocks in the Layout array. Each col- 
lection of data over all the superblocks forms a single message for communication 
for processor j. 

If there are no node contentions in the schedule, each source processor stores (R * 
C)/(P r * P c ) messages, each of size (N * N/(R * C)) in the original order of 
the data layout. The messages received on the destination processor are unpacked 
into individual blocks and stored at an offset of (R/Q r ) * (C/Q c ) elements from 
the previous data block in the local array. The first data block is stored at zero th 
location of the local array. If the communication schedule has node contentions, 
the order of the messages are shuffled according to row or column transformations. 
In such cases, the destination processor performs reverse index computation and 
stores the data at the correct offset. 
Step 5: Data Transfer 

The message size in each send communication is equal to (N * N)/(R * C) 
data blocks. Each row in the Cxransfer table corresponds to a single communi- 
cation step. In each communication step, the total volume of messages exchanged 
between the processors is P * (N * N/(R * Cj) data blocks. This volume in- 
cludes cases where data is locally copied to a processor without performing a 
MPLSend and MPI_Recv operation. In a single communication step j, a source 
processor Pi sends the marshalled message to the destination processor given by 

(-"Transfer (j, i), Where < j < CrransferRows, < i < (P r * Pc), 

Data Transfer Cost. For every communication call using MPLSend and MPI_Recv, 
there is a latency overhead associated with it. Let us denote this time to initiate a 
message by A. Let t denote the time taken to transmit a unit size of message from 
source to destination processor. Thus, the time taken to send a message from a 
source processor in single communication step is ((N * N)/(R * C)) * t. The 
total data transfer cost for redistributing the data across destination processors is 

^Transf erRows 

4 Experiments and Results 

This section presents experimental results which demonstrate the performance of our 
two-dimensional block-cyclic redistribution algorithm. The experiments were conduct- 
ed on 50 nodes of a large homogeneous cluster (System X). Each node is a dual 2.3 



GHz PowerPC 970 processor with 4GB of main memory. Message passing was done 
using MPICH2 |fl9l over a Gigabit Ethernet interconnection network. We integrated 
the redistribution algorithm into the resizing library and evaluated its performance by 
measuring the total time taken by the algorithm to redistribute block-cyclic matrices 
from P to Q processors. We present results from two sets of experiments. The first set 
of experiments evaluates the performance of the algorithm for resizing and compares it 
with the Caterpillar algorithm. The second set of experiments focuses on the effects of 
processor topology on the redistribution cost. Table Q] shows all the possible processor 
configurations for various processor topologies. Processor configurations for the one- 
dimensional processor topology (1 x Q r * Q c or Q r * Q c x 1) are not shown in the 
table. For the two set of experiments described in this section, we have used the fol- 
lowing matrix sizes - 2000 x 2000, 4000 x 4000, 6000 x 6000, 8000 x 8000, 12000 x 
12000, 16000 x 16000, 20000 x 20000 and 24000 x 24000. A problem size of 8000 
indicates the matrix 8000 x 8000. The processor configurations listed in Table[T]evenly 
divide the problem sizes listed above. 



Table 1. Processor configuration for various topologies 



Topology 


Processor configurations 


Nearly-square 


1 x 2, 2 x 2, 2 x 3, 2 x 4, 3 x 3, 3 x 4, 4 x 4, 4 x 5, 5 x 5, 
5x6, 6x6, 5x8, 6x8 


Skewed-rectangular 


1 x 2, 2 x 2, 2 x 3, 2 x 4, 3 x 3, 2 x 6, 2 x 8, 2 x 10, 5 x 5, 
3 x 10, 2 x 18, 2 x 20, 2 x 24, 2 x 1, 3 x 2, 4 x 2, 6 x 2, 
8 x 2, 10 x 2, 10 x 3, 18 x 2, 20 x 2, 24 x 2 



4.1 Overall Redistribution Time 

Every time an application acquires or releases processors, the globally distributed data 
has to be redistributed to the new processor topology. Thus, the application incurs a 
redistribution overhead each time it expands or shrinks. We assume a nearly-square 
processor topology for all the processor sizes used in this experiment. The matrix stores 
data as double precision floating point numbers. Figure |4(a)| shows the overhead for 
redistributing large dense matrices for different matrix sizes using the our redistribution 
algorithm. Each data point in the graph represents the data redistribution cost incurred 
when increasing the size of the processor configuration from the previous (smaller) con- 
figuration. Problem size 8000 and 12000 start execution with 2 processors, problem size 
16000 and 20000 start with 4 processors, and the 24000 case starts with 6 processors. 
The starting processor size is the smallest size which can accommodate the data. The 
trend shows that the redistribution cost increases with matrix size, but for a fixed matrix 
size the cost decreases as we increase the number of processors. This makes sense be- 
cause for small processor size, the amount of data per processor that must be transferred 
is large. Also the communication schedule developed by our redistribution algorithm is 



independent of the problem size and depends only on the source and destination pro- 
cessor set size. 




(a) Expansion (b) Shrinking 



Fig. 4. Redistribution overhead incurred while resizing using ReSHAPE. 

Figure |4(b)| shows the overhead cost incurred while shrinking large matrices from 
P processors to Q processors. In this experiment, we assign the values for P from the 
set 25, 40, 50 and Q from the set 4, 8, 10, 25 and 32. Each data point in the graph rep- 
resents the redistribution overhead incurred while shrinking at that problem size. From 
the graph, it is evident that the redistribution cost increases as we increase the prob- 
lem size. Typically, a large difference between the source and destination processor set 
results in higher redistribution cost. The rate at which the redistribution cost increases 
depends on the size of source and destination processor set. But we note that smaller 
destination processor set size has a greater impact on the redistribution cost compared 
to the difference between the processor set sizes. This is shown in the graph where the 
redistribution cost for shrinking from P = 50 to Q = 32 is lower compared to the cost 
when shrinking from P = 25 to Q = 10 or P = 25 to Q = 8. 

Figure |5(a)| and |5(b)| compares the total redistribution cost of our algorithm and 
the Caterpillar algorithm. We have not compared the redistribution costs with the bi- 
partite redistribution algorithm as our algorithm assumes that data redistribution from 
P to Q processors includes an overlapping set processors from the source and desti- 
nation processor set. The total redistribution time is the sum total of schedule com- 
putation time, index computation time, packing and unpacking the data and the data 
transfer time. In each communication step, each sender packs a message before send- 
ing it and the receiver unpacks the message after receiving it. The Caterpillar algorithm 
does not attempt to schedule communication operations and send equal sized messages 
in each step. Figure [5(a)| shows experimental results for redistributing block-cyclic two- 
dimensional arrays from a 2 x 4 processor grid to a 5 x 8 processor grid. On average, 
the total redistribution time of our algorithm is 12.7 times less than the Caterpillar algo- 
rithm. In Figure [5(b)] the total redistribution time of our algorithm is about 32 times less 
than of the Caterpillar algorithm. In our algorithm, the total number of communication 
calls for redistributing from 8 to 40 processors is 80 whereas in Caterpillar the number 



5000 10000 15000 20000 GOOD 10000 15000 20000 

(a) Redistribution overhead while resizing from (b) Redistribution overhead while resizing from 
8 to 40 processors 8 to 50 processors 

Fig. 5. Comparing the total redistribution time for data redistribution in our algorithm 
with Caterpillar algorithm 



is 160. Similarly, the number of MPI communication calls in our algorithm for redis- 
tributing 2D block-cyclic array from 8 processors to 50 processors is 196 as compared 
to 392 calls in the Caterpillar algorithm. 



4.2 Effects of Processor Topology on Total Redistribution Time 




In this experiment, we report the performance of our redistribution algorithm with 
four different processor topologies — One-dimensional-row (Row-major), One-dim- 
ensional-column (Column major), Skewed-rectangular-row (P r x P c , P r > Pc) and 
Skewed-rectangular-column (P r x P c , P r < P c ). The processor configurations used 
for the Skewed-rectangular topologies are listed in TableQ] Figure |6(a)| and Figure |6(bj| 



shows the overhead for redistributing problem size 20000 and 24000 across different 
processor topologies using the our redistribution algorithm, respectively. The total re- 
distribution cost for redistributing 20000 x 20000 matrix across an one-dimensional 
topology is comparable to the total redistribution cost on a nearly-square processor 
topology (see Figure |4(a)] i. In the case of skewed-rectangular topologies, the total redis- 
tribution time is slightly higher compared to the redistribution cost with nearly-square 
processor topologies. We ran this experiment on other problem sizes — 8000 x 8000 
and 16000 x 16000 and observed results similar to Figure |6(a)| An increase in the total 
redistribution time for skewed-rectangular topology can be due to one of the two situa- 
tions. 

(1) There is an increase in the total number of messages to be transferred using the 
communication schedule. 

(2) Node contention in the communication schedule is high. 

Since the dimensions of a superblock depends upon source and destination pro- 
cessor row and columns, a change in the processor topology can change the number 
of elements in a superblock. As a result, the number of messages exchanged between 
processors will also vary thereby increasing or decreasing the total redistribution time. 
Figure [6(b)| shows that the total redistribution cost for a skewed processor topology sud- 
denly increases when the processor size increases from 30 to 36 (10 x 3 to 18 x 2). In 
this case the number of elements in superblock increases to 540. Table|2]shows the total 
MPI send/receive counts for redistributing between different processor sets on different 
topologies. From Table [2] we note that data redistribution using a skewed-rectangular 
processor topology requires exactly half the number of send/receive operation as com- 
pared to nearly-square topology. The algorithm uses only 18 MPI send/receive opera- 
tions to redistribute data from 4 to 20 processors and 36 to redistribute from 8 to 40 
processors as compared to 36 and 72 respectively required for a nearly-square topol- 
ogy. In Figure |6(a)| the cost of redistribution in a P < Q topology is more than the 
redistribution cost for a P > Q topology. The reason for this additional overhead can 
be attributed to increased number of node contentions in the comunication schedule for 
the P < Q topology. The node contentions reduces as the processor size increases and 
the topology is maintained in subsequent iterations. When data is redistributed from P 
= 25 (square topology) to Q = 40 (skewed topology), node contentions in the communi- 
cation schedule of Q = 40 (10x4) are higher compared to the schedule for redistribution 
to Q = 40 (4 x 10). 

5 Discussion and Future Work 

In this paper we have introduced a framework, ReSHAPE, that enables parallel message 
passing applications to be resized during execution. We have extended the functionality 
of the resizing library in ReSHAPE to support redistribution of 2-D block-cyclic ma- 
trices distributed across a 2-D processor topology. We build upon the work by Park et 
al. Ifl6l to derive an efficient 2-D redistribution algorithm. Our algorithm redistributes 
a two-dimensional block-cyclic data distribution on a 2-D grid of P (P r x P c ) proces- 
sors to two-dimensional block-cyclic data distribution on a 2-D grid with Q (Q r x Q c ) 



Table 2. Counting topology dependent Send/Recvs. (P, Q) = size of source and desti- 
nation processor set 



Redistribution 


Communication 


Nearly square 


1 Dimensional 


Skewed-rectangle 


configuration 


steps 


Copy 


Send/Recv 


Copy 


Send/Recv 


Copy 


Send/Recv 


(2,4) 


2 


2 


2 


2 


2 


2 


2 


(4, 6) 


3 


3 


9 


4 


8 


3 


9 


(4, 8) 


2 


2 


6 


4 


4 


2 


6 


(6, 9) 


3 


6 


12 


6 


12 


3 


15 


(8, 16) 


2 


8 


8 


8 


8 


4 


12 


(9, 12) 


4 


6 


30 


9 


27 


3 


33 


(12, 16) 


4 


12 


36 


12 


36 


12 


36 


(16, 20) 


5 


10 


70 


16 


64 


16 


64 


(20, 25) 


5 


20 


80 


20 


80 


5 


95 


(25, 30) 


6 


15 


135 


25 


125 


4 


146 


(25, 40) 


8 


7 


193 


20 


180 


25 


175 


(30, 36) 


6 


30 


150 


30 


150 


15 


525 


(36, 48) 


4 


12 


132 


36 


108 


36 


108 


(4, 20) 


10, 5 (skewed) 


2 


38 


4 


36 


2 


18 


(8, 40) 


10, 5 (skewed) 


8 


72 


8 


72 


4 


36 


(8, 50) 


25 


8 


192 


8 


192 


8 


192 



processors, where P and Q can be any arbitrary positive value. The algorithm ensures a 
contention-free communication schedule if P r < Q r , P c < Q c . For all other conditions 
involving P r , P c , Q r , Q c , the algorithm minimizes node contention in the communica- 
tion schedule by performing a sequence of row or column circular shifts. We also show 
the ease of use of API provided by the framework to port and execute applications to 
make use of ReSHAPE's dynamic resizing capability. Currently the algorithm can re- 
distribute N x N blocks of data on P processors to Q processors only if Q r and Q c 
evenly divide N so that all the processors have equal number of integer blocks. We plan 
to generalize this assumption so that the algorithm can redistribute data between P and 
Q processors for any arbitrary value of P and Q. 

We are currently evaluating ReSHAPE framework with different scheduling strate- 
gies for processor reallocation, quality-of-service and advanced reservation services. 
We are also working towards adding resizing capabilities to several production scien- 
tific codes and adding support for a wider array of distributed data structures and other 
data redistribution algorithms. Finally, we plan to make ReSHAPE a more extensible 
framework so that support for heterogeneous clusters, grid infrastructure, shared mem- 
ory architectures, and distributed memory architectures can be implemented as individ- 
ual plug-ins to the framework. 
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